Skip to content

Add four-nines reliability framework with circuit breakers, DLQ, and monitoring#2

Merged
eighteyes merged 12 commits intomainfrom
claude/improve-reliability-four-nines-K38rW
Mar 11, 2026
Merged

Add four-nines reliability framework with circuit breakers, DLQ, and monitoring#2
eighteyes merged 12 commits intomainfrom
claude/improve-reliability-four-nines-K38rW

Conversation

@eighteyes
Copy link
Copy Markdown
Owner

@eighteyes eighteyes commented Mar 11, 2026

Summary

Implements a comprehensive reliability module (src/reliability/) providing four-nines (99.99%) patterns for TX mesh execution. This adds production-grade failure handling, recovery, and monitoring capabilities.

Key Changes

Core Reliability Components:

  • ReliabilityManager (reliability-manager.ts): Central coordinator integrating all reliability features with dispatcher bindings for kill/requeue actions
  • DeadLetterQueue (dead-letter-queue.ts): Session-aware failure recovery with two modes—session resume (preserves conversation context) and message requeue (for undeliverable messages)
  • CircuitBreaker (circuit-breaker.ts): Per-agent failure isolation with closed/open/half-open states, SQLite persistence, and configurable thresholds
  • HeartbeatMonitor (heartbeat-monitor.ts): Detects stalled workers via output silence detection (warn/stale/dead levels) with automatic worker killing
  • SLITracker (sli-tracker.ts): Service level indicator tracking with success rates, MTTR calculation, and failure taxonomy (model_error, routing_error, timeout, stuck, etc.)
  • SafeMode (safe-mode.ts): Gradual autonomy control with four levels (normal/cautious/restricted/lockdown) that restrict tool access based on SLI thresholds
  • CheckpointLog (checkpoint-log.ts): FSM state transition checkpoints for recovery rewinding to named states

Integration:

  • Dispatcher now initializes ReliabilityManager and binds kill/requeue callbacks
  • New CLI commands: tx mesh health [mesh] (SLI dashboard), tx mesh dlq [mesh] (DLQ entries), tx mesh recover [mesh] (recovery workflow)
  • Configuration via .ai/tx/data/config.yaml with per-component thresholds

Test Meshes:

  • reliability-test/: Simple 3-agent mesh (planner→worker→checker) for testing basic reliability features
  • reliability-fsm/: FSM-based mesh (analyze→build→verify) for testing state machine integration and iteration tracking

Documentation:

  • docs/reliability.md: Comprehensive guide mapping each "nine" to TX mechanisms with feature tables and configuration examples
  • Updated docs/guardrails.md with reliability configuration section
  • Updated src/prompt/core.ts with recovery workflow guidance (diagnose → present options → get approval)

Implementation Details

  • Session-aware recovery: DLQ captures sessionId at failure time, enabling resume that preserves full conversation history rather than raw message replay
  • Persistent circuit breaker state: Checkpoints to SQLite for recovery across restarts
  • Configurable thresholds: All timeouts, failure counts, and SLI escalation points configurable via YAML
  • Failure taxonomy: Categorizes failures (model_error, routing_error, stuck, circuit_open, etc.) for targeted diagnostics
  • Safe mode auto-escalation: Optional automatic restriction of agent capabilities when SLI drops below thresholds
  • Per-mesh isolation: Circuit breakers, safe mode levels, and checkpoints tracked per mesh/agent for failure isolation

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg

claude added 12 commits March 11, 2026 06:21
…t breakers, SLI tracking, and safe mode

Implements Karpathy's "March of Nines" patterns for TX mesh reliability:
- Dead Letter Queue: Failed messages persist for replay instead of silent drops
- Circuit Breaker: Per-agent failure isolation prevents cascading failures
- Heartbeat Monitor: Detects stalled workers at warn/stale/dead thresholds
- SLI Tracker: Measures success rates, MTTR, and failure taxonomy per mesh
- Safe Mode: Gradual autonomy control (normal/cautious/restricted/lockdown)
- ReliabilityManager: Single integration point wired into dispatcher

Includes two test meshes (reliability-test, reliability-fsm) and updated guardrails docs.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
… breaker checkpointing

- DLQ: replayOne(), replayAll(), replayForAgent() — re-injects failed messages
  back into the live system through SystemMessageWriter with [DLQ REPLAY] prefix
  and original failure context
- Circuit Breaker: SQLite checkpointing — persists open/half_open circuit states
  to circuit_breaker_checkpoints table, restores on restart so agents that were
  failing before a crash stay circuit-broken
- HeartbeatMonitor: Fix NodeJS.Timeout type to ReturnType<typeof setInterval>
- ReliabilityManager: Expose replayDLQ(), replayDLQEntry(), replayDLQForAgent()
  and pass DB to CircuitBreaker constructor for persistence

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
…replay

Replace naive message replay with session-aware recovery that preserves
conversation history. DLQ now captures sessionId at failure time and uses
RecoveryMode (session_resume/requeue/manual) to determine the right
recovery strategy. When a worker crashes mid-work with an active session,
recovery resumes the SDK session instead of replaying a raw message.

- Rewrite DLQ schema with session_id, recovery_mode, failure_category
- Update ReliabilityManager with session-aware deadLetter() and recover*() APIs
- Wire dispatcher error handler to capture sessionId and route exhausted
  retries to DLQ with full session context
- Export RecoveryMode, RecoveryResult, FailureContext types from index

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
Surface reliability internals via CLI:

- tx mesh health [mesh] — SLI nines dashboard with success rate,
  MTTR, failure categories, circuit breaker states, safe mode level,
  agent health, and DLQ summary. Per-mesh/per-agent breakdown when
  mesh name provided.

- tx mesh dlq [mesh] — List pending dead letter queue entries with
  recovery mode (session_resume/requeue/manual), failure context,
  retry counts, and session hints.

- tx mesh dlq clear — Garbage collect recovered DLQ entries.

Both support --json for programmatic consumption.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
Every reliability feature now has actuation, not just observation:

Heartbeat dead → kill:
  ReliabilityManager.bindDispatcher() receives killAgent callback.
  When heartbeat fires 'dead', it kills the stuck worker via
  AbortController.abort() and records the failure.

DLQ recovery (3 trigger paths):
  1. Automatic on startup — dispatcher calls recoverAll()
  2. CLI — tx mesh recover <mesh> sends SIGUSR2 to dispatcher
  3. Front-matter — message with recover: true triggers recovery
  Session resume: writes message with session-id front-matter so
  dispatcher spawns worker resuming the SDK conversation.
  Requeue: re-injects original message via SystemMessageWriter.

Safe mode enforcement:
  createSafeModeHook() returns a PreToolUse hook (same pattern as
  write-gate) that blocks Write/Edit/Bash at restricted+ levels.
  Hook is registered per-agent at spawn time.

SIGUSR2 dlq-recover control signal in start.ts.
tx mesh recover CLI with SIGUSR2 + message fallback.
Test mesh config with tight thresholds for quick testing.
docs/reliability.md — complete guide for all features.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
Adds the ability to rewind recovery to any FSM state checkpoint,
not just the crash point. Core agent can now say "rewind-to: build"
to skip failed work and resume from a known-good state.

Checkpoint log (SQLite):
  Saves session IDs at every FSM state transition in the dispatcher's
  onWorkerComplete handler. Key: mesh_name + state_name → session_id.
  Lookup, list, GC, and clear operations.

rewind-to front-matter:
  recover: true + rewind-to: <state> on a message looks up the
  checkpoint for that state and uses its session ID instead of
  the DLQ entry's crash-point session.

Three trigger paths:
  1. CLI: tx mesh recover <mesh> --rewind-to=build
  2. Message: recover: true + rewind-to: build front-matter
  3. SIGUSR2: {"action":"dlq-recover","mesh":"x","rewindTo":"build"}

tx mesh recover now shows available checkpoints before recovering.

Core prompt updated with Reliability & Recovery section teaching
the agent how to use recover, rewind-to, and check health.

mesh-builder skill updated with reliability front-matter fields.
docs/reliability.md updated with checkpoint log docs.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
Core prompt now requires a 4-step workflow for recovery:
1. Diagnose (tx mesh health + tx mesh dlq)
2. Present options to user (resume vs rewind vs drop + checkpoints)
3. Get explicit confirmation
4. Execute chosen recovery

Never triggers recovery silently. The user sees what failed, why,
and which checkpoints are available before approving any action.

docs/reliability.md updated to match: human-initiated recovery
is the preferred path, automatic recovery only on startup crash.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
Each reliability priority now has explicit human review steps:
1. Checkpoints + replay: checkpoint notification, replay approval, post-replay review
2. Metrics + tracking: threshold alerts, safe mode escalation/de-escalation approval
3. Retry-with-variation: failure notification, variation transparency, exhaustion review
4. Schema validation: failure notification, correction approval, partial pass handling
5. Agent classification: classification review, non-critical failure reporting, promotion decisions
6. Observability dashboard: anomaly alerts, trend review, cost gates, weekly digest

Core principle: "The system does work. The human makes decisions."
Core prompt updated with condensed human review gates checklist.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
…gates

Documents all existing reliability features organized by nines level:
- Nine 1 (90%): SQLite WAL, worker retries, injection retries, routing correction
- Nine 2 (99%): Parity gate, FSM validation, mesh validator, identity gate, write gate
- Nine 2.5: Nudge detector, deadlock breaker, stale cleaner, quality iteration loops
- Nine 3 (99.9%): Circuit breaker, heartbeat, DLQ, SLI tracker, safe mode, checkpoints
- Nine 4 (99.99%): Roadmap items with human review gates

Each level includes a feature table (what/where) and explicit human review steps.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
Adds features found across the codebase organized by nines level:
- Nine 1: graceful shutdown, usage policy recovery, recovery handler escalation
- Nine 2: manifest validator, guardrail config chain
- Nine 2.5: session suspend/resume, FSM persistence + backup, session store backfill
- Nine 3: rate limiter, worker pool backpressure, metrics aggregator, worker lifecycle tracking

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
Bash guard (write-gate.ts createBashHook) intercepts Bash redirects
(>, >>, tee) and validates target paths against write manifest.
Strike system: 1-2 errors with paths, 3+ kills worker.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
… review

- Add summary table to Nine 3 (matching Nine 1/2/2.5 format)
- Add detailed explanations for all Nine 1/2/2.5 features
- Extract all human review gates to dedicated HUMAN_REVIEW.md
- Restructure roadmap into table + explanations

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new src/reliability/ module and integrates it into TX mesh execution to provide DLQ recovery, circuit breaking, heartbeat-based stuck detection, SLI tracking, safe-mode enforcement, and FSM checkpoint-based rewind recovery. It also extends the CLI and documentation to expose health/DLQ/recovery workflows and introduces two new test meshes to exercise the features.

Changes:

  • Integrate ReliabilityManager into the worker dispatcher for spawn gating, heartbeat tracking, DLQ routing/recovery, safe-mode tool blocking, and FSM checkpoint logging.
  • Add new reliability primitives (DLQ, circuit breaker, heartbeat monitor, SLI tracker, safe mode, checkpoint log) and expose CLI commands (tx mesh health|dlq|recover).
  • Add documentation and test meshes for reliability workflows and configuration.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/worker/dispatcher.ts Initializes and wires reliability manager; adds safe-mode hook, heartbeat recording, spawn gating, DLQ routing/recovery triggers, FSM checkpoint saving.
src/reliability/sli-tracker.ts New in-memory SLI tracker for success/failure rates, MTTR, and failure taxonomy.
src/reliability/safe-mode.ts New safe-mode state machine for per-mesh/global tool/action restrictions based on SLI thresholds.
src/reliability/reliability-manager.ts Coordinator tying together DLQ, circuit breaker, heartbeat, SLI, safe mode, and checkpoints; provides recovery and status APIs.
src/reliability/index.ts Barrel exports for the reliability module.
src/reliability/heartbeat-monitor.ts New heartbeat monitor for warn/stale/dead silence thresholds with callbacks.
src/reliability/dead-letter-queue.ts New SQLite-backed DLQ with recovery modes and stats.
src/reliability/circuit-breaker.ts New per-agent circuit breaker with SQLite checkpoint persistence.
src/reliability/checkpoint-log.ts New SQLite-backed FSM checkpoint/session-id log enabling rewind-to recovery.
src/prompt/core.ts Adds reliability & recovery guidance and the human-approval workflow to the core prompt.
src/cli/start.ts Adds SIGUSR2 control action to trigger DLQ recovery through a running dispatcher.
src/cli/mesh.ts Adds tx mesh health, tx mesh dlq, and tx mesh recover commands and --rewind-to parsing.
meshes/reliability-test/config.yaml Adds linear 3-agent test mesh configured with tight reliability thresholds.
meshes/reliability-test/planner/prompt.md Adds planner prompt for reliability-test mesh.
meshes/reliability-test/worker/prompt.md Adds worker prompt for reliability-test mesh.
meshes/reliability-test/checker/prompt.md Adds checker prompt for reliability-test mesh.
meshes/reliability-fsm/config.yaml Adds FSM test mesh (analyze/build/verify/complete) for checkpoint/rewind testing.
meshes/reliability-fsm/analyst/prompt.md Adds analyst prompt for reliability-fsm mesh.
meshes/reliability-fsm/builder/prompt.md Adds builder prompt for reliability-fsm mesh.
meshes/reliability-fsm/verifier/prompt.md Adds verifier prompt for reliability-fsm mesh.
docs/reliability.md Adds comprehensive reliability documentation and operational workflows.
docs/guardrails.md Documents reliability configuration and mechanisms in the guardrails guide.
docs/HUMAN_REVIEW.md Adds human review gates and policies for reliability features.
.gitignore Whitelists the two new reliability test meshes.
.claude/skills/mesh-builder/SKILL.md Documents reliability-related front-matter fields for recovery/resume workflows.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +168 to +171
case 'half_open':
// Allow single probe request
return true;
}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In half_open state, canExecute() returns true for every call, which allows unlimited concurrent probe executions. That doesn’t match the “allow one probe request” behavior described in the header comment and in docs. Consider tracking an in-flight probe (or transitioning to open/closed after the first probe decision) so only one request is allowed until it succeeds/fails.

Copilot uses AI. Check for mistakes.
Comment on lines +1197 to +1207
requeueMessage: (from: string, to: string, type: string, payload: Record<string, unknown>, extraFrontmatter?: Record<string, string>) => {
this.systemWriter.write({
from,
to,
type,
headline: (payload.headline as string) || 'DLQ recovery',
body: (payload.body as string) || '',
extraFrontmatter: { ...extraFrontmatter, ...Object.fromEntries(
Object.entries(payload).filter(([k]) => !['headline', 'body'].includes(k)).map(([k, v]) => [k, String(v)])
)},
});
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requeueMessage writes session-id, resume-mesh, etc. into extraFrontmatter, but SystemMessageWriter.write() only inserts {msg-id, headline, body, filepath} into the SQLite queue payload. Since the dispatcher later reads nextMsg.payload['session-id'] / ['resume-mesh'], recovered messages won’t actually resume sessions or preserve mesh state. Consider ensuring these fields are also present in the queued payload (or change the dispatcher to parse frontmatter from source_file when deciding session continuation).

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +10
* Monitor last output timestamps and escalate when stale.
*
* Stale detection levels:
* 1. Warning (60s no output): Log, could be thinking
* 2. Stale (120s no output): Inject nudge to worker
* 3. Dead (300s no output): Kill worker, route to DLQ
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module header says stale agents (120s silence) trigger a nudge injection, but HeartbeatMonitor only logs and exposes callbacks; and ReliabilityManager.bindDispatcher() currently just logs on stale. Either implement the nudge behavior at the stale threshold (via dispatcher bindings) or adjust the comment/docs to match the current behavior.

Suggested change
* Monitor last output timestamps and escalate when stale.
*
* Stale detection levels:
* 1. Warning (60s no output): Log, could be thinking
* 2. Stale (120s no output): Inject nudge to worker
* 3. Dead (300s no output): Kill worker, route to DLQ
* Monitor last output timestamps and surface health state so callers can react.
*
* Stale detection levels (as reported by this monitor):
* 1. Warning (60s no output): Mark as "warn" and log; worker may still be thinking
* 2. Stale (120s no output): Mark as "stale", log, and invoke callbacks so callers can nudge
* 3. Dead (300s no output): Mark as "dead", log, and invoke callbacks so callers can terminate / DLQ

Copilot uses AI. Check for mistakes.
Comment on lines +402 to +404
Messages that fail delivery after max retries are routed to the DLQ instead of being silently dropped. DLQ entries persist in SQLite and can be replayed manually.

- Automatic retry with exponential backoff
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DLQ section claims “automatic retry with exponential backoff”, but the implementation shown in this PR doesn’t include DLQ retry scheduling/backoff logic (DLQ is a persistence + recovery mechanism; worker retries are currently fixed-delay). Update the docs to reflect the actual behavior, or implement the described backoff mechanism.

Suggested change
Messages that fail delivery after max retries are routed to the DLQ instead of being silently dropped. DLQ entries persist in SQLite and can be replayed manually.
- Automatic retry with exponential backoff
Messages that fail delivery after max retries are routed to the DLQ instead of being silently dropped. DLQ entries persist in SQLite and can be replayed manually; the DLQ itself does not schedule retries or implement backoff.
- Integration with worker retries (fixed-delay; DLQ only after retries exhausted)

Copilot uses AI. Check for mistakes.

// Load config from config.yaml if exists
const fileConfig = this.loadConfigFromFile(workDir);
const merged = { ...fileConfig, ...config };
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config merging is shallow (const merged = { ...fileConfig, ...config }), so providing a partial nested override (e.g., { circuitBreaker: { cooldownMs: ... } }) will drop other circuitBreaker values from fileConfig. Consider a per-section merge (merged.circuitBreaker = { ...fileConfig.circuitBreaker, ...config?.circuitBreaker }, etc.) to avoid surprising config loss.

Suggested change
const merged = { ...fileConfig, ...config };
const merged: ReliabilityConfig = {
...fileConfig,
...config,
circuitBreaker: {
...(fileConfig?.circuitBreaker ?? {}),
...(config?.circuitBreaker ?? {}),
},
heartbeat: {
...(fileConfig?.heartbeat ?? {}),
...(config?.heartbeat ?? {}),
},
safeMode: {
...(fileConfig?.safeMode ?? {}),
...(config?.safeMode ?? {}),
},
dlq: {
...(fileConfig?.dlq ?? {}),
...(config?.dlq ?? {}),
},
sli: {
...(fileConfig?.sli ?? {}),
...(config?.sli ?? {}),
},
};

Copilot uses AI. Check for mistakes.
Comment on lines +1212 to +1219
// Recover any pending DLQ entries from previous crash
const dlqRecovery = this.reliability.recoverAll();
if (dlqRecovery.length > 0) {
log.info('dispatcher', 'DLQ startup recovery', {
attempted: dlqRecovery.length,
succeeded: dlqRecovery.filter(r => r.success).length,
failed: dlqRecovery.filter(r => !r.success).length,
});
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recoverAll() is invoked unconditionally on startup. This will auto-recover any pending DLQ entries even if they were created by retry exhaustion (not just a crash), which conflicts with the “human approval required” guidance and can cause repeated recovery loops. Consider gating this behind an explicit config flag and/or only auto-recovering entries that were marked as crash-recovery eligible.

Suggested change
// Recover any pending DLQ entries from previous crash
const dlqRecovery = this.reliability.recoverAll();
if (dlqRecovery.length > 0) {
log.info('dispatcher', 'DLQ startup recovery', {
attempted: dlqRecovery.length,
succeeded: dlqRecovery.filter(r => r.success).length,
failed: dlqRecovery.filter(r => !r.success).length,
});
// Optionally recover any pending DLQ entries from previous crash.
// Controlled via WORKER_AUTO_RECOVER_DLQ env var to avoid reprocessing
// entries that may require human approval (e.g., retry exhaustion).
const shouldAutoRecoverDlq =
process.env.WORKER_AUTO_RECOVER_DLQ === 'true' ||
process.env.WORKER_AUTO_RECOVER_DLQ === '1';
if (shouldAutoRecoverDlq) {
const dlqRecovery = this.reliability.recoverAll();
if (dlqRecovery.length > 0) {
log.info('dispatcher', 'DLQ startup recovery', {
attempted: dlqRecovery.length,
succeeded: dlqRecovery.filter(r => r.success).length,
failed: dlqRecovery.filter(r => !r.success).length,
});
}

Copilot uses AI. Check for mistakes.
toAgent: agentId,
msgType: nextMsg?.type,
payload: nextMsg?.payload as Record<string, unknown>,
sourceFile: nextMsg?.source_file,
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DLQ entries created on retry exhaustion aren’t being marked as such. DeadLetterQueue.add() decides recovery_mode using retry_count, but deadLetter() is called without passing the worker’s retryCount/maxRetries, so entries will default to retry_count=0 and often become auto-recoverable even after retries were exhausted. Pass machine.currentContext.retryCount (and ideally maxRetries) into the DLQ entry so exhausted cases correctly become manual and don’t auto-recover forever.

Suggested change
sourceFile: nextMsg?.source_file,
sourceFile: nextMsg?.source_file,
// Include retry metadata so DLQ can determine correct recovery mode
retryCount: machine.currentContext.retryCount,
maxRetries: machine.currentContext.maxRetries,
retry_count: machine.currentContext.retryCount,
max_retries: machine.currentContext.maxRetries,

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +34
export interface SLIConfig {
/** How long to retain data in ms (default: 7 days) */
retentionMs: number;
/** Bucketing interval for rate calculations (default: 60000 = 1 min) */
bucketMs: number;
}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SLIConfig.bucketMs is defined and included in DEFAULT_CONFIG, but it’s never referenced. Either implement bucketing (e.g., to support rate over time windows without scanning all events) or remove the config field to avoid a misleading API surface.

Copilot uses AI. Check for mistakes.
const queue = new MessageQueue(queuePath);
const reliability = new ReliabilityManager(queue.getDb(), cwd);
const status = reliability.getStatus(300_000); // 5 min window

Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tx mesh health instantiates a fresh ReliabilityManager, but SLI, heartbeat health, and safe-mode state are currently in-memory only. As a result, this command will typically report 100% success with 0 events and default safe-mode/health, which is misleading. Either persist these metrics/state (SQLite tables, log-backed snapshots, etc.) or have mesh health query a running dispatcher (similar to the SIGUSR2 control path) so it reflects real runtime data.

Suggested change
// Note: ReliabilityManager keeps key state in-memory in the dispatcher.
// A freshly instantiated manager (as used here) may only see persisted data.
// If there are no recorded events, reporting "100% success" is misleading,
// so treat health as unavailable instead.
const sli: any = status.sli as any;
const totalEvents = sli && typeof sli.totalEvents === 'number' ? sli.totalEvents : undefined;
if (!totalEvents || totalEvents === 0) {
const warning = 'No reliability events recorded for this time window. ' +
'SLI/health/safe-mode state is unavailable outside a running dispatcher with in-memory state.';
if (flags.json) {
console.log(JSON.stringify({ ...status, warning, statusUnavailable: true }, null, 2));
return;
}
console.log();
console.log(chalk.bold('Reliability Health'));
console.log(chalk.dim('─'.repeat(50)));
console.log(chalk.yellow(warning));
return;
}

Copilot uses AI. Check for mistakes.
Comment on lines +77 to 85
} else if (arg.startsWith('--rewind-to=')) {
flags.rewindTo = arg.split('=')[1];
} else if (arg === '--rewind-to') {
// Next arg will be picked up as a positional, but we handle it here
const idx = args.indexOf(arg);
if (idx < args.length - 1 && !args[idx + 1].startsWith('-')) {
flags.rewindTo = args[idx + 1];
}
}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseFlags() uses args.indexOf('--rewind-to') while iterating, which returns the first occurrence and can mis-read the value if flags repeat or ordering changes. Use an indexed loop (for i=0; i<args.length; i++) to reliably read the next arg.

Copilot uses AI. Check for mistakes.
@eighteyes
Copy link
Copy Markdown
Owner Author

@copilot open a new pull request to apply changes based on the comments in this thread

@eighteyes eighteyes merged commit c0845c5 into main Mar 11, 2026
4 checks passed
Copy link
Copy Markdown

Copilot AI commented Mar 11, 2026

@eighteyes I've opened a new pull request, #3, to work on those changes. Once the pull request is ready, I'll request review from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants